A two-phase rule generation and optimization approach for wrapper generation

نویسندگان

  • Yanan Hao
  • Yanchun Zhang
چکیده

Web information extraction is a fundamental issue for web information management and integrations. A common approach is to use wrappers to extract data from web pages or documents. However, a critical issue for wrapper development is how to generate extraction rules. In this paper, we propose a novel two-phase rule generation and optimization (2P-RULE) approach for wrapper generation. 2P-RULE consists of internal rule optimization (IRO) process and external rule optimization (ERO) process. In IRO, a user, through a GUI interface, firstly creates a mapping from useful values in web page to a schema specified by the users according to target web information. Based on the mapping, the system automatically generates a rule list for the schema. Whereas in ERO, the user can create multiple mappings to generate further rule lists. All the acquired rule lists are merged and refined into one optimized rule list, which is expressed with XQuery as the final extraction rules. Experiments show that our 2P-RULE approach is suitable for extracting information from web pages with complex nested structure, and can also achieve better precision and recall ratio⋅.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection

Feature selection for various applications has been carried out for many years in many different research areas. However, there is a trade-off between finding feature subsets with minimum length and increasing the classification accuracy. In this paper, a filter-wrapper feature selection approach based on fuzzy-rough gain ratio is proposed to tackle this problem. As a search strategy, a modifie...

متن کامل

Improvement of Rule Generation Methods for Fuzzy Controller

This paper proposes fuzzy modeling using obtained data. Fuzzy system is known as knowledge-based or rule-bases system. The most important part of fuzzy system is rule-base. One of problems of generation of fuzzy rule with training data is inconsistence data. Existence of inconsistence and uncertain states in training data causes high error in modeling. Here, Probability fuzzy system presents to...

متن کامل

Optimal Locating and Sizing of Unified Power Quality Conditioner- phase Angle Control for Reactive Power Compensation in Radial Distribution Network with Wind Generation

In this article, a multi-objective planning is demonstrated for reactive power compensation in radial distribution networks with wind generation via unified power quality conditioner (UPQC). UPQC model, based on phase angle control (PAC), is used. In presented method, optimal locating of UPQC-PAC is done by simultaneous minimizing of objective functions such as: grid power loss, percentage of n...

متن کامل

Three-dimensional CFD modeling of fluid flow and heat transfer characteristics of Al2O3/water nanofluid in microchannel heat sink with Eulerian-Eulerian approach

In this paper, three-dimensional incompressible laminar fluid flow in a rectangular microchannel heat sink (MCHS) using Al2O3/water nanofluid as a cooling fluid is numerically studied. CFD prediction of fluid flow and forced convection heat transfer properties of nanofluid using single-phase and two-phase model (Eulerian-Eulerian approach) are compared. Hydraulic and thermal performance of microch...

متن کامل

Combined Use of Sensitivity Analysis and Hybrid Wavelet-PSO- ANFIS to Improve Dynamic Performance of DFIG-Based Wind Generation

In the past few decades, increasing growth of wind power plants causes different problems for the power quality in the grid. Normal and transient impacts of these units on the power grid clearly indicate the need to improve the quality of the electricity generated by them in the design of such systems. Improving the efficiency of the large-scale wind system is dependent on the control parameter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006